A New Semantic Similarity Metric for Solving Sparse Data Problem in Ontology based Information Retrieval System
نویسنده
چکیده
Semantic similarity assessing methods play a central role in many research areas such as Psychology, cognitive science, information retrieval biomedicine and Artificial intelligence. This paper discuss the existing semantic similarity assessing methods and identify how these could be exploited to calculate accurately the semantic similarity of WordNet concepts. The semantic similarity approaches could broadly be classified into three different categories: Ontology based approaches (structural approach), information theoretic approaches (corpus based approach) and hybrid approaches. All of these similarity measures are expected to preferably adhere to certain basic properties of information. The survey revealed the following drawbacks The information theoretic measures are dependent on the corpus and the presence or absence of a concept in the corpus affects the information content metric. For the concepts not present in the corpus the value of information content tends to become zero or infinity and hence the semantic similarity measure calculated based on this metric do not reflect the actual information content of the concept. Hence in this paper we propose a new information content metric which provides a solution to the sparse data problem prevalent in corpus based approaches. The proposed measure is corpus independent and takes into consideration hyponomy and meronomy relations. Empirical studies of finding similarity of R&G data set using existing Resnik, lin and J& C semantic similarity methods with the proposed information content metric is to be studied. We also propose a new semantic similarity measure based on the proposed information content metric and hypernym relations. The correctness of the information content metric proposed is to be proved by comparing the results against the human judgments available for R& G set. Further the information content metric used earlier by Resnik, lin and Jiang and Cornath methods may produce better results with alternate corpora other than brown corpus. Hence the effect of corpus based information content metric on alternate corpora is also investigated. Keywords-Ontology, similarity method, information retrieval, conceptual similarity, taxonomy, corpus based 1 . INTRODUCTION The goal of Information retrieval process is to retrieve Information relevant to a given request. The aim is to retrieve all the relevant information eliminating the nonrelevant information. An information retrieval system comprises of representation, semantic similarity matching function and Query. Representation comprises the abstract description of documents in the system. The semantic similarity matching function defines how to compare query requests to the stored descriptions in the representation. The percentage of relevant information we get mainly depends on the semantic similarity matching function we used. So far, there are several semantic similarity methods used which have certain limitations despite the advantages. No one method replaces all the semantic similarity methods. When a new information retrieval system is going to be build, 40 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 11, May 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 several questions arises related to the semantic similarity matching function to be used. In the last few decades, the number of semantic similarity methods developed is high. This paper discusses the overall view of different similarity measuring methods used to compare and find very similar concepts of ontology. We also discuss about the pros and cons of existing similarity metrics. We have presented a new approach which is independent of the corpora for finding the semantic similarity between two concepts. Section II, a set of basic intuitive properties are defined to which the compatibility of similarity measures in information is preferable. Section III discusses various approaches used for similarity computation and the limitations of those methods. In Section IV, comparison among different semantic similarity measures are discussed. In Section V we introduce a new semantic similarity approach and an algorithm for finding the similarity between all the relations in the WordNet taxonomy. The results based on new similarity measure is promising. 2. ONTOLOGY SIMILARITY In this section, a set of intuitive and qualitative properties that a similarity method should adhere to is discussed.[20] Basic Properties Any similarity measure must be compatible with the basic properties as they express the exact notion of property. o CommonalityProperty o Difference Property o Identity Property Retrieval Specific Properties The similarity measure cannot be symmetrical in case of ontology based information retrieval context. The similarity is directly proportional to specialization and inversely proportional to generalization. o Generalization Property Structure Specific Properties The distance represented by an edge should be reduced with an increasing depth. o Depth Property o Multiple Paths Property 3. APPROACHES USED FOR SIMILARITY COMPUTATION In this section, we discuss about various similarity methods[20]. The similarity methods are Path Length Approaches Depth-Relative Approaches Corpus-based Approaches Multiple-Paths Approaches 3.1 Path Length Approach The shortest path length and the weighted shortest path are the two taxonomy based approaches for measuring similarity through inclusion relation. Shortest Path Length A simple way to measure semantic similarity in a taxonomy is to evaluate the distance between the nodes corresponding to the items being compared. The shorter distance results in high similarity. In Rada et al. [1989][1][14], shortest path length approach is followed assuming that the number of edges between terms in a taxonomy is a measure of conceptual distance between concepts. distRada(ci; cj) = minimal number of edges in a path from ci to cj This method yields good results. since the paths are restricted to ISA relation, the path lengths corresponds to conceptual distance. Moreover, the experiment has been conducted for specific domain ensuring the hierarchical homogeneity. The drawback with this approach is that, it is compatible only with commonality and difference properties and not with identity property. 3.1.2. Weighted Shortest Path Length This is another simple edge-counting approach. In this method, weights are assigned to edges. In brief, weighted shortest path measure is a generalization of the shortest path length. Obviously it supports commonality and difference properties. Similarity of immediate specialisation -Similarity of immediate generalisation P=(p1,.....,pn) where, Pi ISA pi+1 or Pi+1 ISA pi For each I with x=p1 and y=pn Given a path P=(p1,.....pn), set s(P) to the number of specializations and g(P) to the number of generalizations along the path P as follows: s(P)= |{i\pi ISA pi+1}| (1) g(P)=|{i|Pi+1 ISA pi}| (2) If p1,......pm are all paths connecting x and y, then the degree to which y is similar to x can be defined as follows: simWSP(x,y)=max{ s(pj) }(3) j=1,....m The similarity between two concepts x and y, sim(x,y) WSP(weighted Shortest Path) is calculated as the maximal product of weights along the paths between x and y. Similarity can be derived as the products of weights on the paths. s(pj) g(Pj s(P) and g(P) = 0 41 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 11, May 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 Hence the weighted shortest path length overcomes the limitations of shortest path length wherein the measure is based on generalization property and achieves identity property. 3.2 Depth-Relative Approaches Even though the edge counting method is simple, it limits the representation of uniform distances on the edges in the taxonomy. This approach supports structure specific property as the distance represented by an edge should be reduced with an increasing depth. 3.2.1 Depth Relative Scaling In his depth-relative scaling approach Sussna[1993][2] defines two edges representing inverse relations for each edge in a taxonomy. The weight attached to each relation r is a value in the range [minr; maxr]. The point in the range for a relation r from concept c1 to c2 depends on the number nr of edges of the same type, leaving c1, which is denoted as the type specifc fanout factor: W(c1→r c2)=maxr-{maxr--minr/nr(c1)} The two inverse weights are averaged and scaled by depth d of the edge in the overall taxonomy. The distance between adjacent nodes c1 and c2 are computed as: dist sussna(c1,c2)=w(c1→ r c2)+ (c1→ r’ c2)/2d (4) where r is the relation that holds between c1 and c2, and r’ is its inverse. The semantic distance between two arbitrary concepts c1 and c2 is computed as the sum of distances between the pairs of adjacent concepts along the shortest path connecting c1 and c2. 3.2.2 Conceptual Similarity Wu and Palmer [1994][3], propose a measure of semantic similarity on the semantic representation of verbs in computer systems and its impact on lexical selection problems in machine translation. Wu and Palmer define conceptual similarity between a pair of concepts c1 and c2 as: Sim wu&palmer(c1,c2)= (5) Where N1 is the number of nodes on the path from c1 to a concept c3. , denoting the least upper bound of both c1 and c2. N2 is the number of nodes on a path from c2 to c3. N3 is the number of nodes from c3 to the most general concept. 3.2.3 Normalised Path Length Leacock and Chodorow [1998][4], proposed an approach for measuring semantic similarity as the shortest path using is a hierarchies for nouns in WordNet. This measure determines the semantic similarity between two synsets (concepts) by finding the shortest path and by scaling using the depth of the taxonomy: Sim Leacock&Chaodorow(c1,c2)= -log(Np(c1,c2)/2D) (6) Np (c1,c2) denotes the shortest path between the synsets (measured in nodes), and D is the maximum depth of the taxonomy. 3.3 Corpus-based Approach The knowledge disclosed by the corpus analysis is used to intensify the information already present in the ontologies or taxonomies. In this method, presents three approaches that incorporate corpus analysis as an additional, and qualitatively different knowledge source. 3.3.1 Information Content In this method rather than counting edges in the shortest path, they select the maximum information content of the least upper bound between two concepts. Resnik [1999] [5], argued that a widely acknowledged problem with edgecounting approaches was that they typically rely on the notion that edges represent uniform distances. According to Resnik's measure, information content, uses knowledge from a corpus about the use of senses to express non-uniform distances. Let C denote the set of concepts in a taxonomy that permits multiple inheritance and associates with each concept c 2 C, the probability p(c) of encountering an instance of concept c. For a pair of concepts c1 and c2, their similarity can be defined as: SimResnik (7) where, S(c1,c2): Set of least upper bounds in the taxonomy of c1 and c2 p(c) :Monotonically non-decreasing as one moves up in the taxonomy, p(c1) ≤ p(c2), if c1 is a c2. The similarity between the two words w1 and w2 can be computed as: wsimResnik (8) Where, s(wi): Set of possible senses for the word wi. Resnik describes an implementation based on information content using WordNet's [Miller, 1990][6], taxonomy of noun concepts [1999]. The information content of each concept is calculated using noun frequencies Freq(c) = where, words(c): Set of words whose senses are subsumed by concept c. (c)=freq(c)/N where , N: is the total number of nouns. The major drawback of the information content approach is that they fail to comply with the generalization property, due to symmetry. 3.3.2 Jiang and Conrath's Approach(Hybrid Method) Jiang and Conrath [1997][7]proposed a method to synthesize edge-counting methods and information content 42 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 11, May 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 into a combined model by adding the information content as a corrective factor. The edge weight between a child concept cc and a parent concept cp can be calculated by considering factors such as local density in the taxonomy, node depth, and link type as, Wt(cc,cp)= (9) Where, d(cp) : Depth of the concept cp in the taxonomy, E(cp) :Number of children of cp (the local density) (1E) : Average density in the entire taxonomy, LS(cc, cp) : Strength of the edge between cc and cp, T(cc,cp) : Edge relation/type factor The parameters 0 and ,0 1 control the in°uence of concept depth and density, respectively. strength of a link LS(cc; cp) between parent and child concepts is proportional to the conditional probability p(cc|cp) of encountering an instance of the child concept, cc, given an instance of the parent concept, cp: LS(cc, cp) = -log p(cc|cp) Resnik assigned probability to the concepts as p(cc \cp) = p(cc), because any instance of a child concept cc is also an instance of the parent concept cp. Then: p(Cc|Cp) = p(Cc|Cp) = If IC(c) denotes the information content of concept c. then: LS(Cc,Cp) = IC(Cc)-IC(Cp) Jiang and Conrath then defined the semantic distance between two nodes as the summation of edge weights along the shortest path between them [J. Jiang, 1997]: distjiang&conrath(C1,C2)= (10) Where, path(c1, c2) : the set of all nodes along the shortest path between c1 and c2 parent(c) : is the parent node of c LSuper(c1, c2) : is the lowest superordinate (least upper bound) on the path between c1 and c2.. Jiang and Conrath's approach made information content compatible with the basic properties and the depth property, but not to the generalization property. 3.3.3 Lin's Universal Similarity Measure Lin [1997; 1998][8][13] defines a measure of similarity claimed to be both universally applicable to arbitrary objects and theoretically justified. He achieved generality from a set of assumptions. Lin's information-theoretic definition of similarity builds on three basic properties, commonality, difference and identity. In addition to these properties he assumed that the commonality between A and B is measured by the amount of information contained in the proposition that states the commonalities between them, formally: I(common(A,B)) = -log p(common(A,B) where, I(s): Negative logarithm of the probability of the proposition, as described by Shannon[1949]. The difference between A and B is measured by: I(description(A,B)) – I(common(A,B)) Where, description(A;B) : Proposition about what A and B are. Lin proved that the similarity between A and B is measured by, simLin(A,B)= (11) The ratio between the amount of information needed to state the commonality of A and B and the information needed to describe them fully. Lin’s similarity between two concepts in a taxonomy ensures that: SimLin(c1,c2)= (12) where, LUB(c1, c2): Least upper bound of c1 and c2 p(x) : Estimated based on statistics from a sense tagged corpus. This approach comply with the set of basic properties and the depth property, but would fail to comply with the generalization property as it is symmetric. 3.4 Multiple-Paths Approaches This approach solves the problem with single path approach. Single path as a measure for the similarity, fails to truly express similarity whenever the ontologies allow multiple inheritance. In multiple-path approach measurement is made by taking into account all the semantic relations in ontologies, considering more than one path between concepts. Attributes should influence the measure of similarity, thus allowing two concepts sharing the same attribute to be considered as more similar, compared to concepts not having this particular attribute. 3.4.1 Medium-Strong Relations Hirst and St-Onge [Hirst and St-Onge, 1998; St-Onge, 1995] [9][15], distinguishes the nouns in the Wordnet as extra-strong, strong and medium-strong relations. The extrastrong relation is only between a word and its literal repetition. 3.4.2 Generalised Weighted Shortest Path The principle of weighted path similarity can be generalized by introducing similarity factors for the semantic relations. However, there does not seem to be an obvious way to differentiate based on direction. Thus, we can generalize simply by introducing a single similarity factor 43 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 11, May 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 and simplify to bidirectional edges. This method solves the symmetry problem by introducing weighted edges. 3.4.3 Shared Nodes This approach overcomes the limitation of single path length approach. Multiple paths are considered for measuring the similarity. The shared nodes approach with similarity function discussed above complies with all the defined properties. 3.4.4 Weighted Shared Nodes Similarity It is found that when deriving similarity using the notion of shared nodes, not all nodes are equally important. Assigning weights to edges is very important, as it generalizes the measure so that it can be make use for different domains with different semantic relations. The weighted shared nodes measure complies with all the defined properties. 4. COMPARISON OF DIFFERENT SIMILARITY MEASURES In this section we discuss about the results of comparison of the measures to human similarity judgments. The first human similarity judgment was done by Rubinstein and Goodenough [1965][11], using two groups totaling 51 subjects to perform synonymy judgments on 65 pairs of nouns and this in turn been the basis of the comparison of similarity measures. Miller and Charles [1991][12] repeated Rubinstein and Goodenough's original experiment, they used a subset of 30 noun pairs from the original list of 65 pairs, where ten pairs were from the high level of synonymy, ten from the middle level and ten from the low level. The correlation between these two experiments is 0.97. An experiment were performed by taking the replica of the Miller and Charles experiment that included 30 concept pairs and an additional ten new compound concepts pairs and human judgment for all the 40 pairs has been performed. The correlations for the measures done by Resnik, Jiang and Conrath; Lin, Hirst and St-Onge; and Leacock and Chodorow are shown in table given below. [Budanitsky, 2001][18][19][20] Table 2: Correlation between Different Similarity Measures & Human Similarity Judgments from the Miller and Charles Experiment Approach Correlation
منابع مشابه
A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation
Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...
متن کاملPublic Transport Ontology for Passenger Information Retrieval
Passenger information aims at improving the user-friendliness of public transport systems while influencing passenger route choices to satisfy transit user’s travel requirements. The integration of transit information from multiple agencies is a major challenge in implementation of multi-modal passenger information systems. The problem of information sharing is further compounded by the multi-l...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملDevelopment of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism
Introduction: Autism is a nervous system disorder, and since there is no direct diagnosis for it, data mining can help diagnose the disease. Ontology as a backbone of the semantic web, a knowledge database with shareability and reusability, can be a confirmation of the correctness of disease diagnosis systems. This study aimed to provide a system for diagnosing autistic children with a combinat...
متن کاملDevelopment of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism
Introduction: Autism is a nervous system disorder, and since there is no direct diagnosis for it, data mining can help diagnose the disease. Ontology as a backbone of the semantic web, a knowledge database with shareability and reusability, can be a confirmation of the correctness of disease diagnosis systems. This study aimed to provide a system for diagnosing autistic children with a combinat...
متن کاملDeveloping a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information
With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010